Iterative Random Forests to detect predictive and stable high-order interactions

نویسندگان

  • Sumanta Basu
  • Karl Kumbier
  • James B. Brown
  • Bin Yu
چکیده

Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genomewide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that operate in vivo as components of larger molecular machines that regulate gene expression. Understanding these processes and the high-order interactions that govern them presents a substantial statistical challenge. Building on Random Forests (RF), Random Intersection Trees (RIT), and through extensive, biologically inspired simulations, we developed iterative Random Forests (iRF). iRF leverages the Principle of Stability to train an interpretable ensemble of decisions trees and detect stable, high-order interactions with same order of computational cost as RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity for the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, iRF re-discovered the essential role of zelda (zld) in early zygotic enhancer activation, and novel third-order interactions, e.g. between zld, giant (gt), and twist (twi). In human-derived cells, iRF re-discovered that H3K36me3 plays a central role in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry in genome biology, automating hypothesis generation for the discovery of new molecular mechanisms from high-throughput, genome-wide datasets.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Iterative random forests to discover predictive and stable high-order interactions

Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge...

متن کامل

Stability of variable importance scores and rankings using statistical learning tools on single-nucleotide polymorphisms and risk factors involved in gene × gene and gene × environment interactions

Risk of complex disorders is thought to be multifactorial, involving interactions between risk factors. However, many genetic studies assess association between disease status and markers one single-nucleotide polymorphism (SNP) at a time, due to the high-dimensional nature of the search space of all possible interactions. Three ensemble methods have been recently proposed for use in high-dimen...

متن کامل

Random forests algorithm in podiform chromite prospectivity mapping in Dolatabad area, SE Iran

The Dolatabad area located in SE Iran is a well-endowed terrain owning several chromite mineralized zones. These chromite ore bodies are all hosted in a colored mélange complex zone comprising harzburgite, dunite, and pyroxenite. These deposits are irregular in shape, and are distributed as small lenses along colored mélange zones. The area has a great potential for discovering further chromite...

متن کامل

Comparison of Stability Parameters for Detection of Stable and High Essential Oil Yielding Landraces of Rosa damascena Mill.

The essential oil yield stability of damask rose (Rosa damascena Mill.) as an important medicinal and aromatic plant in different environments has not been well documented. In order to determine appropriate stability parameters, six statistics were studied for essential oil stability of 35 Rosa damascena landraces in seven locations (Sanandaj, Arak, Kashan, Dezful, Stahban, Ke...

متن کامل

تحلیل الگوی مکانی و اثرات متقابل بلوط ایرانی و بنه در جنگل‌های قلاجه کرمانشاه با استفاده از تابع K2

     Quercus brantii Lindl. and Pistacia atlantica Desf. are the most important tree species in Zagros forests, The abundant use of these trees by the inhabitants of the area has led to a reduction in the quality and quantity of these valuable species, as well as the creation of heterogeneous masses.Recognizing the spatial pattern and the interactions of trees can be a key to managerial interve...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017